WEEK 5: SPECIAL DATA TYPES

Monday, February 6th

Today we will…

Final Project Group Formation

You will be completing a final project in Stat 331/531 in teams of four.

  • Group Formation Survey due Friday, 2/9 at 11:59pm
    • Help me gather information about your preferences and work styles to facilitate team formation.
    • Your team does not all need to be in the same section, but you might find it useful for in-class work time.
  • Group Contracts
  • Project Proposal
  • Final Deliverable

Date + Time Variables

Why are dates and times tricky?

When parsing dates and times, we have to consider complicating factors like…

  • Daylight Savings Time.
    • One day a year is 23 hours; one day a year is 25 hours.
    • Some places use it, some don’t.
  • Leap years – most years have 365 days, some have 366.
  • Time zones.

lubridate

Common Tasks

  • Convert a date-like variable (“May 8, 1995”) to a date or date-time object.

  • Find the weekday, month, year, etc from a date-time object.

  • Convert between time zones.

Note

The lubridate package installs with tidyverse, but does not load.

library(lubridate)

date-time Objects

There are multiple data types for dates and times.

  • A date:
    • date or Date
  • A date and a time (identifies a unique instant in time):
    • dtm
    • POSIXlt – stores date-times as the number of seconds since January 1, 1970 (“Unix Epoch”)
    • POSIXct – stores date-times as a list with elements for second, minute, hour, day, month, year, etc.

Creating date-time Objects

Create a date from individual components:

make_date(year = 1995, month = 05, day = 08)
[1] "1995-05-08"

Create a date from a string:

mdy("May 8, 1995")
[1] "1995-05-08"
dmy("8-May-1995", tz = "America/Chicago")
[1] "1995-05-08 CDT"
dmy_hms("8-May-1995 9:32:12", tz = "America/Chicago")
[1] "1995-05-08 09:32:12 CDT"
as_datetime("95-05-08", format = "%y-%m-%d")
[1] "1995-05-08 UTC"
parse_datetime("5/8/1995", format = "%m/%d/%Y")
[1] "1995-05-08 UTC"

Creating date-time Objects

Common Mistake with Dates

What’s wrong here?

as_datetime(2023-02-6)
[1] "1970-01-01 00:33:35 UTC"


my_date <- 2023-02-6
my_date
[1] 2015


Make sure you use quotes!

  • 2,015 seconds \(\approx\) 33.5 minutes

Extracting date-time Components

bday <- ymd_hms("1995-05-08 9:32:12", tz = "America/Chicago")
bday
[1] "1995-05-08 09:32:12 CDT"


year(bday)
[1] 1995
month(bday)
[1] 5
day(bday)
[1] 8
wday(bday)
[1] 2
wday(bday, label = TRUE, abbr = FALSE)
[1] Monday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Subtraction with date-time Objects

Doing subtraction gives you a difftime object.

  • difftime objects do not always have the same units – it depends on the scale of the objects you are working with.

How old am I?

today() - mdy(05081995)
Time difference of 10501 days

How long did it take me to finish a typing challenge?

begin  <- mdy_hms("3/1/2023 13:04:34")
finish <- mdy_hms("3/1/2023 13:06:11")
finish - begin
Time difference of 1.616667 mins

Durations and Periods

Durations will always give the time span in an exact number of seconds.

as.duration(today() - mdy(05081995))
[1] "907286400s (~28.75 years)"
as.duration(finish - begin)
[1] "97s (~1.62 minutes)"

Periods will give the time span in more approximate, but human readable times.

as.period(today() - mdy(05081995))
[1] "10501d 0H 0M 0S"
as.period(finish - begin)
[1] "1M 37S"

Durations and Periods

We can also add time:

  • days(), years(), etc. will add a period of time.
  • ddays(), dyears(), etc. will add a duration of time.

Because durations use the exact number of seconds to represent days and years, you might get unexpected results:

When is is my 99th birthday?

mdy(05081995) + years(99)
[1] "2094-05-08"
mdy(05081995) + dyears(99)
[1] "2094-05-07 18:00:00 UTC"

Time Zones

Time zones are complicated!

Specify time zones in the form:

  • {continent}/{city} – “America/Chicago”, “Africa/Nairobi”
  • {ocean}/{city} – “Pacific/Auckland”

What time zone does R think I’m in?

Sys.timezone()
[1] "America/Los_Angeles"

Time Zones (Can Taylor Swift make it to the Superbowl?)

You can change the time zone of a date in two ways:

x <- ymd_hms("2024-02-11 18:00:00", tz = "Asia/Tokyo")

Keeps the instant in time the same, but changes the visual representation.

x |> 
  with_tz()
[1] "2024-02-11 01:00:00 PST"
x |> 
  with_tz(tzone = "America/Los_Angeles")
[1] "2024-02-11 01:00:00 PST"

Changes the instant in time by forcing a time zone change.

x |> 
  force_tz()
[1] "2024-02-11 18:00:00 PST"
x |> 
  force_tz(tzone = "America/Los_Angeles")
[1] "2024-02-11 18:00:00 PST"

Common Mistake with Dates

When you read data in or create a new date-time object, the default time zone (if not specified) is UTC.

  • UTC (Universal Time Coordinated) is the same as GMT (Greenwich Mean Time).

Make sure you specify your desired time zone!

x <- mdy("05/08/1995")
tz(x)
[1] "UTC"
x <- mdy("05/08/1995", tz = "America/Chicago")
tz(x)
[1] "America/Chicago"

PA 5.1: Zodiac Killer

One of the most famous mysteries in California history is the identity of the so-called “Zodiac Killer”, who murdered 7 people in Northern California between 1968 and 1969. A new murder was committed last year in California, suspected to be the work of a new Zodiac Killer on the loose.

Unfortunately, the date and time of the murder is not known. You have been hired to crack the case. Use the clues below to discover the murderer’s identity.

Submit the name of the killer to the Canvas Quiz.

To do…

  • PA 5.1: Zodiac Killer
    • Due Thursday, 2/8 at 8:00am
  • Final Project Group Formation Survey
    • Due Friday, 2/9 at 11:59pm
  • Lab 5: Murder Mystery in SQL City
    • Due Monday 2/12 at 11:59pm

Thursday, February 8th

Today we will…

  • Review
    • PA 5.1: Zodiac Killer
    • Lab 4: Childcare Costs
  • Midterm Exam Thursday, 2/15: What to Expect
  • New Material
    • Strings
    • Regular Expressions
  • Example: “Messy” Covid Variants
  • PA 5.2: Scrambled Message

Midterm Exam – Thursday, 2/15

  • This is a three-part exam:
    1. You will first complete a General Questions section on paper and without your computer.
    2. After you turn that in, you will complete a Short Answer section with your computer.
    • You will have the one hour and 50 minute class period to complete the first two sections.
    1. The third section, Open-Ended Analysis, will be started in class and due 24 hours after the end of class.

Midterm Exam – Thursday, 2/15

  • The exam is worth a total of 100 points.
    • Approx. 20 pts, 30 pts, and 50 pts for the three sections.
  • I will provide a .qmd template for the Short Answer.
  • You will create your own .qmd for the Open-Ended Analysis. You are encouraged to create this ahead of time.

Caution

While the coding tasks are open-resource, you will likely run out of time if you have to look everything up. Know what functions you might need and where to find documentation for implementing these functions.

stringr

strings

A string is a bunch of characters.

Don’t confuse a string (many characters, one object) with a character vector (vector of strings).


my_string <- "Hi, my name is Bond!"
my_vector <- c("Hi", "my", "name", "is", "Bond")


my_string
[1] "Hi, my name is Bond!"


my_vector
[1] "Hi"   "my"   "name" "is"   "Bond"

stringr

Common tasks

  • Find which strings contain a particular pattern

  • Remove or replace a pattern

  • Edit a string (for example, make it lowercase)

Note

The package stringr is very useful for strings!

  • stringr loads with the tidyverse.

  • all the functions are str_xxx().

pattern =

The pattern argument in all of the stringr functions …

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")

str_detect(my_vector, pattern = "Bond")
str_locate(my_vector, pattern = "Bond")
str_match(my_vector, pattern = "Bond")
str_extract(my_vector, pattern = "Bond")
str_subset(my_vector, "pattern = Bond")

Note

Discuss with a neighbor. For each of these functions, give:

  • The object structure of the output.
  • The data type of the output.
  • A brief explanation of what they do.

str_detect()

Returns logical vector TRUE/FALSE indicating if the pattern was found in that element of the original vector

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")
[1] FALSE FALSE  TRUE  TRUE
  • Pairs well with filter()
  • Could be used with summarise() and sum or mean

Related functions

str_subset() returns just the strings that contain the match

str_which() returns the indexes of strings that have a match

str_match()

Returns character matrix with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_match(my_vector, pattern = "Bond")
     [,1]  
[1,] NA    
[2,] NA    
[3,] "Bond"
[4,] "Bond"

str_extract()

Returns character vector with either NA or the pattern, depending on if the pattern was found.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")
[1] NA     NA     "Bond" "Bond"

Warning

str_extract() only returns the first pattern match.

Use str_extract_all() to return every pattern match.

str_locate()

Returns a date frame with two numeric variables for the starting and ending location, giving either NA or the start and end position of the pattern.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_locate(my_vector, pattern = "Bond")
     start end
[1,]    NA  NA
[2,]    NA  NA
[3,]     1   4
[4,]     7  10

str_subset()

Returns a character vector with a subset of the original character vector with elements where the pattern occurs.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")
[1] "Bond"       "James Bond"

Related Functions

str_sub() extracts values based on location.

Replace / Remove patterns

replaces the first matched pattern

  • Pairs well with mutate()
str_replace(my_vector, pattern = "Bond", replace = "Franco")
[1] "Hello,"       "my name is"   "Franco"       "James Franco"

Removes the first matched pattern

Special case – str_replace(x, pattern, replace = "")

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_remove(my_vector, pattern = "Bond")
[1] "Hello,"     "my name is" ""           "James "    

Related functions

str_replace_all() replaces all matched patterns

str_remove_all() removes all matched patterns

Make edits

Convert letters in the string to a specific capitalization format.

converts all letters in the strings to lowercase


my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_to_lower(my_vector)
[1] "hello,"     "my name is" "bond"       "james bond"

converts all letters in the strings to uppercase


str_to_upper(my_vector)
[1] "HELLO,"     "MY NAME IS" "BOND"       "JAMES BOND"

converts the first letter of the strings to uppercase


str_to_title(my_vector)
[1] "Hello,"     "My Name Is" "Bond"       "James Bond"

Combine Strings

Joins multiple strings into a single string.

prompt <- "Hello, my name is"
first  <- "James"
last   <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")
[1] "Hello, my name is Bond , James Bond"

Note

Similar to paste() and paste0()

Combines into a single string.

my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_flatten(my_vector, collapse = " ")
[1] "Hello, my name is Bond James Bond"

Note

str_c() will do the same thing, but it it is encouraged to use str_flatten() instead.

Uses environment to create a string and evaluates {expressions}.

first <- "James"
last <- "Bond"
str_glue("My name is {last}, {first} {last}")
My name is Bond, James Bond

Tip

See the R package glue!

Hints and Tips for Success

  • Refer to the stringr cheatsheet

  • Remember that str_xxx functions need the first argument to be a vector of strings, not a data set.

    • You might want to use them inside functions like filter() or mutate().
cereal |> 
  mutate(
    is_bran = str_detect(name, "Bran"), 
    .after = name
  )
                                     name is_bran manuf type calories protein
1                               100% Bran    TRUE     N cold       70       4
2                       100% Natural Bran    TRUE     Q cold      120       3
3                                All-Bran    TRUE     K cold       70       4
4               All-Bran with Extra Fiber    TRUE     K cold       50       4
5                          Almond Delight   FALSE     R cold      110       2
6                 Apple Cinnamon Cheerios   FALSE     G cold      110       2
7                             Apple Jacks   FALSE     K cold      110       2
8                                 Basic 4   FALSE     G cold      130       3
9                               Bran Chex    TRUE     R cold       90       2
10                            Bran Flakes    TRUE     P cold       90       3
11                           Cap'n'Crunch   FALSE     Q cold      120       1
12                               Cheerios   FALSE     G cold      110       6
13                  Cinnamon Toast Crunch   FALSE     G cold      120       1
14                               Clusters   FALSE     G cold      110       3
15                            Cocoa Puffs   FALSE     G cold      110       1
16                              Corn Chex   FALSE     R cold      110       2
17                            Corn Flakes   FALSE     K cold      100       2
18                              Corn Pops   FALSE     K cold      110       1
19                          Count Chocula   FALSE     G cold      110       1
20                     Cracklin' Oat Bran    TRUE     K cold      110       3
21                 Cream of Wheat (Quick)   FALSE     N  hot      100       3
22                                Crispix   FALSE     K cold      110       2
23                 Crispy Wheat & Raisins   FALSE     G cold      100       2
24                            Double Chex   FALSE     R cold      100       2
25                            Froot Loops   FALSE     K cold      110       2
26                         Frosted Flakes   FALSE     K cold      110       1
27                    Frosted Mini-Wheats   FALSE     K cold      100       3
28 Fruit & Fibre Dates; Walnuts; and Oats   FALSE     P cold      120       3
29                          Fruitful Bran    TRUE     K cold      120       3
30                         Fruity Pebbles   FALSE     P cold      110       1
31                           Golden Crisp   FALSE     P cold      100       2
32                         Golden Grahams   FALSE     G cold      110       1
33                      Grape Nuts Flakes   FALSE     P cold      100       3
34                             Grape-Nuts   FALSE     P cold      110       3
35                     Great Grains Pecan   FALSE     P cold      120       3
36                       Honey Graham Ohs   FALSE     Q cold      120       1
37                     Honey Nut Cheerios   FALSE     G cold      110       3
38                             Honey-comb   FALSE     P cold      110       1
39            Just Right Crunchy  Nuggets   FALSE     K cold      110       2
40                 Just Right Fruit & Nut   FALSE     K cold      140       3
41                                    Kix   FALSE     G cold      110       2
42                                   Life   FALSE     Q cold      100       4
43                           Lucky Charms   FALSE     G cold      110       2
44                                  Maypo   FALSE     A  hot      100       4
45       Muesli Raisins; Dates; & Almonds   FALSE     R cold      150       4
46      Muesli Raisins; Peaches; & Pecans   FALSE     R cold      150       4
47                   Mueslix Crispy Blend   FALSE     K cold      160       3
48                   Multi-Grain Cheerios   FALSE     G cold      100       2
49                       Nut&Honey Crunch   FALSE     K cold      120       2
50              Nutri-Grain Almond-Raisin   FALSE     K cold      140       3
51                      Nutri-grain Wheat   FALSE     K cold       90       3
52                   Oatmeal Raisin Crisp   FALSE     G cold      130       3
53                  Post Nat. Raisin Bran    TRUE     P cold      120       3
54                             Product 19   FALSE     K cold      100       3
55                            Puffed Rice   FALSE     Q cold       50       1
56                           Puffed Wheat   FALSE     Q cold       50       2
57                     Quaker Oat Squares   FALSE     Q cold      100       4
58                         Quaker Oatmeal   FALSE     Q  hot      100       5
59                            Raisin Bran    TRUE     K cold      120       3
60                        Raisin Nut Bran    TRUE     G cold      100       3
61                         Raisin Squares   FALSE     K cold       90       2
62                              Rice Chex   FALSE     R cold      110       1
63                          Rice Krispies   FALSE     K cold      110       2
64                         Shredded Wheat   FALSE     N cold       80       2
65                 Shredded Wheat 'n'Bran    TRUE     N cold       90       3
66              Shredded Wheat spoon size   FALSE     N cold       90       3
67                                 Smacks   FALSE     K cold      110       2
68                              Special K   FALSE     K cold      110       6
69                Strawberry Fruit Wheats   FALSE     N cold       90       2
70                      Total Corn Flakes   FALSE     G cold      110       2
71                      Total Raisin Bran    TRUE     G cold      140       3
72                      Total Whole Grain   FALSE     G cold      100       3
73                                Triples   FALSE     G cold      110       2
74                                   Trix   FALSE     G cold      110       1
75                             Wheat Chex   FALSE     R cold      100       3
76                               Wheaties   FALSE     G cold      100       3
77                    Wheaties Honey Gold   FALSE     G cold      110       2
   fat sodium fiber carbo sugars potass vitamins shelf weight cups   rating
1    1    130  10.0   5.0      6    280       25     3   1.00 0.33 68.40297
2    5     15   2.0   8.0      8    135        0     3   1.00 1.00 33.98368
3    1    260   9.0   7.0      5    320       25     3   1.00 0.33 59.42551
4    0    140  14.0   8.0      0    330       25     3   1.00 0.50 93.70491
5    2    200   1.0  14.0      8     -1       25     3   1.00 0.75 34.38484
6    2    180   1.5  10.5     10     70       25     1   1.00 0.75 29.50954
7    0    125   1.0  11.0     14     30       25     2   1.00 1.00 33.17409
8    2    210   2.0  18.0      8    100       25     3   1.33 0.75 37.03856
9    1    200   4.0  15.0      6    125       25     1   1.00 0.67 49.12025
10   0    210   5.0  13.0      5    190       25     3   1.00 0.67 53.31381
11   2    220   0.0  12.0     12     35       25     2   1.00 0.75 18.04285
12   2    290   2.0  17.0      1    105       25     1   1.00 1.25 50.76500
13   3    210   0.0  13.0      9     45       25     2   1.00 0.75 19.82357
14   2    140   2.0  13.0      7    105       25     3   1.00 0.50 40.40021
15   1    180   0.0  12.0     13     55       25     2   1.00 1.00 22.73645
16   0    280   0.0  22.0      3     25       25     1   1.00 1.00 41.44502
17   0    290   1.0  21.0      2     35       25     1   1.00 1.00 45.86332
18   0     90   1.0  13.0     12     20       25     2   1.00 1.00 35.78279
19   1    180   0.0  12.0     13     65       25     2   1.00 1.00 22.39651
20   3    140   4.0  10.0      7    160       25     3   1.00 0.50 40.44877
21   0     80   1.0  21.0      0     -1        0     2   1.00 1.00 64.53382
22   0    220   1.0  21.0      3     30       25     3   1.00 1.00 46.89564
23   1    140   2.0  11.0     10    120       25     3   1.00 0.75 36.17620
24   0    190   1.0  18.0      5     80       25     3   1.00 0.75 44.33086
25   1    125   1.0  11.0     13     30       25     2   1.00 1.00 32.20758
26   0    200   1.0  14.0     11     25       25     1   1.00 0.75 31.43597
27   0      0   3.0  14.0      7    100       25     2   1.00 0.80 58.34514
28   2    160   5.0  12.0     10    200       25     3   1.25 0.67 40.91705
29   0    240   5.0  14.0     12    190       25     3   1.33 0.67 41.01549
30   1    135   0.0  13.0     12     25       25     2   1.00 0.75 28.02576
31   0     45   0.0  11.0     15     40       25     1   1.00 0.88 35.25244
32   1    280   0.0  15.0      9     45       25     2   1.00 0.75 23.80404
33   1    140   3.0  15.0      5     85       25     3   1.00 0.88 52.07690
34   0    170   3.0  17.0      3     90       25     3   1.00 0.25 53.37101
35   3     75   3.0  13.0      4    100       25     3   1.00 0.33 45.81172
36   2    220   1.0  12.0     11     45       25     2   1.00 1.00 21.87129
37   1    250   1.5  11.5     10     90       25     1   1.00 0.75 31.07222
38   0    180   0.0  14.0     11     35       25     1   1.00 1.33 28.74241
39   1    170   1.0  17.0      6     60      100     3   1.00 1.00 36.52368
40   1    170   2.0  20.0      9     95      100     3   1.30 0.75 36.47151
41   1    260   0.0  21.0      3     40       25     2   1.00 1.50 39.24111
42   2    150   2.0  12.0      6     95       25     2   1.00 0.67 45.32807
43   1    180   0.0  12.0     12     55       25     2   1.00 1.00 26.73451
44   1      0   0.0  16.0      3     95       25     2   1.00 1.00 54.85092
45   3     95   3.0  16.0     11    170       25     3   1.00 1.00 37.13686
46   3    150   3.0  16.0     11    170       25     3   1.00 1.00 34.13976
47   2    150   3.0  17.0     13    160       25     3   1.50 0.67 30.31335
48   1    220   2.0  15.0      6     90       25     1   1.00 1.00 40.10596
49   1    190   0.0  15.0      9     40       25     2   1.00 0.67 29.92429
50   2    220   3.0  21.0      7    130       25     3   1.33 0.67 40.69232
51   0    170   3.0  18.0      2     90       25     3   1.00 1.00 59.64284
52   2    170   1.5  13.5     10    120       25     3   1.25 0.50 30.45084
53   1    200   6.0  11.0     14    260       25     3   1.33 0.67 37.84059
54   0    320   1.0  20.0      3     45      100     3   1.00 1.00 41.50354
55   0      0   0.0  13.0      0     15        0     3   0.50 1.00 60.75611
56   0      0   1.0  10.0      0     50        0     3   0.50 1.00 63.00565
57   1    135   2.0  14.0      6    110       25     3   1.00 0.50 49.51187
58   2      0   2.7  -1.0     -1    110        0     1   1.00 0.67 50.82839
59   1    210   5.0  14.0     12    240       25     2   1.33 0.75 39.25920
60   2    140   2.5  10.5      8    140       25     3   1.00 0.50 39.70340
61   0      0   2.0  15.0      6    110       25     3   1.00 0.50 55.33314
62   0    240   0.0  23.0      2     30       25     1   1.00 1.13 41.99893
63   0    290   0.0  22.0      3     35       25     1   1.00 1.00 40.56016
64   0      0   3.0  16.0      0     95        0     1   0.83 1.00 68.23588
65   0      0   4.0  19.0      0    140        0     1   1.00 0.67 74.47295
66   0      0   3.0  20.0      0    120        0     1   1.00 0.67 72.80179
67   1     70   1.0   9.0     15     40       25     2   1.00 0.75 31.23005
68   0    230   1.0  16.0      3     55       25     1   1.00 1.00 53.13132
69   0     15   3.0  15.0      5     90       25     2   1.00 1.00 59.36399
70   1    200   0.0  21.0      3     35      100     3   1.00 1.00 38.83975
71   1    190   4.0  15.0     14    230      100     3   1.50 1.00 28.59278
72   1    200   3.0  16.0      3    110      100     3   1.00 1.00 46.65884
73   1    250   0.0  21.0      3     60       25     3   1.00 0.75 39.10617
74   1    140   0.0  13.0     12     25       25     2   1.00 1.00 27.75330
75   1    230   3.0  17.0      3    115       25     1   1.00 0.67 49.78744
76   1    200   3.0  17.0      3    110       25     1   1.00 1.00 51.59219
77   1    200   1.0  16.0      8     60       25     1   1.00 0.75 36.18756

regex

Regular Expressions

“Regexps are a very terse language that allow you to describe patterns in strings.”

R for Data Science

R uses “extended” regular expressions, which are common.

str_detect(string  = my_string_vector, 
           pattern = "REGULAR EXPRESSION"
           )

Web app to test R regular expressions

Tip

Regular expressions are a reason to use stringr!

You might encounter gsub(), grep(), etc. from Base R.

Meta Characters . ^ $ \ | * + ? { } [ ] ( )

toung_twister <- c("She", "sells", "seashells", "by", "the", "seashore!")
toung_twister
[1] "She"       "sells"     "seashells" "by"        "the"       "seashore!"


. Represents any character

str_subset(toung_twister, pattern = ".ells")
[1] "sells"     "seashells"
toung_twister <- c("She", "sells", "seashells", "by", "the", "seashore!")
toung_twister
[1] "She"       "sells"     "seashells" "by"        "the"       "seashore!"


^ Looks at the beginning

str_subset(toung_twister, pattern = "^s")
[1] "sells"     "seashells" "seashore!"

$ Looks at the end

str_subset(toung_twister, pattern = "s$")
[1] "sells"     "seashells"
shells_str <- c("shes", "shels", "shells", "shellls", "shelllls")
shells_str
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"


? Occurs 0 or 1 times

str_subset(shells_str, pattern = "shel?s")
[1] "shes"  "shels"

+ Occurs 1 or more times

str_subset(shells_str, pattern = "shel+s")
[1] "shels"    "shells"   "shellls"  "shelllls"

* Occurs 0 or more times

str_subset(shells_str, pattern = "shel*s")
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"
shells_str <- c("shes", "shels", "shells", "shellls", "shelllls")
shells_str
[1] "shes"     "shels"    "shells"   "shellls"  "shelllls"


{n} matches exactly n times.

str_subset(shells_str, pattern = "shel{2}s")
[1] "shells"

{n,} matches at least n times.

str_subset(shells_str, pattern = "shel{2,}s")
[1] "shells"   "shellls"  "shelllls"

{n,m} matches between n and m times.

str_subset(shells_str, pattern = "shel{1,3}s")
[1] "shels"   "shells"  "shellls"

Groups ()

Groups can be created with ( )

| – “either” / “or”


toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2
[1] "Peter"    "Piper"    "picked"   "a"        "peck"     "of"       "pickled" 
[8] "peppers!"


str_subset(toung_twister2, pattern = "p(e|i)ck")
[1] "picked"  "peck"    "pickled"

Character Classes []

toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
str_subset(toung_twister2, pattern = "p[ei]ck")
[1] "picked"  "peck"    "pickled"

[^ ] except - think “not”

str_subset(toung_twister2, pattern = "p[^i]ck")
[1] "peck"

[ - ] range

str_subset(toung_twister2, pattern = "p[ei]ck[a-z]")
[1] "picked"  "pickled"

[Pp] Capitalization matters

str_subset(toung_twister2, pattern = "^p")
[1] "picked"   "peck"     "pickled"  "peppers!"
str_subset(toung_twister2, pattern = "^[Pp]")
[1] "Peter"    "Piper"    "picked"   "peck"     "pickled"  "peppers!"

[] Character Classes

  • [A-Z] matches any capital letter.
  • [a-z] matches any lowercase letter.
  • [A-z] or [:alpha:] matches any letter
  • [0-9] or [:digit:] matches any number
  • See the stringr cheatsheet for more shortcuts, like [:punct:]

\w Looks for any “word” (conversely “not” “word” \W)

\d Looks for any digit (conversely “not” digit \D)

\s Looks for any whitespace (conversely “not” whitespace \S)

Let’s try it out!

Discuss with a neighbor which regular expressions would search for words that do the following:

  • end with a vowel
  • start with x, y, or z
  • do not contain x, y, or z
  • contain British spelling

Test your answers out on

test_vec <- c("zebra", "xray", "apple", "yellow", "color", "colour", "summarize", "summarise")

Escape \

In order to match a special character you need to “escape” first

Warning

In general, look at punctuation characters with suspicion.

 [1] "How"       "much"      "wood"      "could"     "a"         "woodchuck"
 [7] "chuck"     "if"        "a"         "woodchuck" "could"     "chuck"    
[13] "wood?"    
str_subset(toung_twister3, pattern = "?")
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
str_subset(toung_twister3, pattern = "\\?")
[1] "wood?"

Note

Could also use [] character class

str_subset(toung_twister3, pattern = "[?]")
[1] "wood?"

When in Doubt



Use the web app to test R regular expressions

Tips for working with regex

  • Read the regular expressions out loud like a “request”
  • Test out your expressions on small examples first.

str_view()

str_view(c("shes", "shels", "shells", "shellls", "shelllls"), "l+")
[2] │ she<l>s
[3] │ she<ll>s
[4] │ she<lll>s
[5] │ she<llll>s
  • I use the stringr cheatsheet more than any other package cheatsheet!

  • Be kind to yourself when working with regular expressions!

  • Read the regular expressions out loud like a “request”

Strings in the tidyverse

stringr functions + dplyr verbs!

Find countries that start with an “A”:

military |> 
  filter(str_detect(string  = Country, 
                    pattern = "^A"
                    )
         ) |> 
  distinct(Country)
Country
Africa
Algeria
Angola
Americas
Argentina
Asia & Oceania
Afghanistan
Australia
Albania
Armenia
Azerbaijan
Austria

Find the proportion of countries containing a compass direction:

military |> 
  distinct(Country) |> 
  summarize(prop = mean(str_detect(string = Country,
                                   pattern = "[Nn]orth|[Ss]outh|[Ee]ast|[Ww]est"
                                   )
                        )
            )
# A tibble: 1 × 1
    prop
   <dbl>
1 0.0789

matches(pattern)

Selects all variables with a name that matches the supplied pattern

  • pairs well with select(), rename_with(), and across()
military_clean <- military |> 
  mutate(across(`1988`:`2019`, 
                ~ na_if(.x, y = ". .")
                ),
         across(`1988`:`2019`, 
                ~ na_if(.x, y = "xxx")
                )
         )
military_clean <- military |> 
  mutate(across(matches("[1-9]{4}"), 
                ~ na_if(.x, y = ". .")),
         across(matches("[1-9]{4}"), 
                ~ na_if(.x, y = "xxx")))

“Messy” Covid Variants

I received this data from a grad school colleague the other day who asked if I knew how to “clean” it.

What is that column?!

[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]

Enter stringr!

Let’s see how this works.

PA 5.2: Scrambled Message

In this activity, you will be using regular expressions to decode a message.

x <- c("She", "sells", "seashells", "by", "the", "seashore!")
  • Grab elements out of a vector with [].
x[c(1,4,5)]
[1] "She" "by"  "the"
  • To replace those elements, use <- to assign new values.
x[c(1,4,5)] <- ""

To do…

  • PA 5.2: Scrambled Message
    • Due Monday, 2/9 at 8:00am
  • Final Project Group Formation Survey
    • Due Friday, 2/9 at 11:59pm
  • Lab 5: Murder Mystery in SQL City
    • Due Monday 2/12 at 11:59pm
  • Read Chapter 6: Version Control
    • Concept Check 6.1 + 6.2 due Tuesday (2/13) at 8:00am